Impact Statement
Atmospheric aerosols influence the amount of solar radiation reflected by Earth, but the magnitude of the effect is highly uncertain, and this is one of the key reasons why climate predictions are highly uncertain. We propose a framework for reducing uncertainty in aerosol effects on radiation by comparing simulations from complex climate models to satellite observations. This framework uses parallel computing and statistical theory to ensure efficient computations and valid inferences.
1. Introduction
Atmospheric aerosols affect the formation of clouds and scatter and absorb visible radiation, thereby influencing Earth’s climate. Improved estimates of the change in the total effect of aerosols on the climate over the industrial era—a highly uncertain quantity termed the aerosol effective radiative forcing (ERF)—have the potential to reduce uncertainty in the sensitivity of the climate to these aerosols. Recent efforts to constrain ERF have involved first reducing uncertainty in the distributions of more basic aerosol-related physical parameters and then studying the effects of these constraints on ERF. This has been notably done by comparing simulated data with observations collected on various global aircraft and ship campaigns as well as from satellites and ground stations (Johnson et al., Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020, Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2018; Regayre et al., Reference Regayre, Schmale, Johnson, Tatzelt, Baccarini, Henning, Yoshioka, Stratmann, Gysel-Beer, Grosvenor and Carslaw2020, Reference Regayre, Johnson, Yoshioka, Pringle, Sexton, Booth, Lee, Bellouin and Carslaw2018). Toward constraining the aerosol radiative forcing, Regayre et al. (Reference Regayre, Deaconu, Grosvenor, Sexton, Symonds, Langton, Watson-Parris, Mulcahy, Pringle, Richardson, Johnson, Rostron, Gordon, Lister, Stier and Carslaw2023) employ simulated outputs from the UKESM1 climate model that are averaged over month-long periods. In contrast, the present work employs three-hourly simulation outputs from the same model, requiring an inferential framework capable of handling this increase in the resolution and quantity of the data.
Establishing constraints on the input parameters of expensive computer models by comparing their outputs with observational data is an area of active research (Biegler et al., Reference Biegler, Biros, Ghattas, Heinkenschloss, Keyes, Mallick, Marzouk, Tenorio, van Bloemen Waanders and Willcox2010). It is often imperative that a surrogate model be trained from an ensemble of model input–output pairs and used in place of the simulator to ensure tractable computations, but this step contributes an additional source of uncertainty. If the outputs, or surrogate outputs, do not match observations within some tolerance, then those parameter values are deemed implausible. In Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020) and Regayre et al. (Reference Regayre, Schmale, Johnson, Tatzelt, Baccarini, Henning, Yoshioka, Stratmann, Gysel-Beer, Grosvenor and Carslaw2020), the method of history matching is used, a technique from oil reservoir engineering that has been adapted to the evaluation of computer models more generally in recent decades (Verly et al., Reference Verly, David, Journel and Marechel1984; Craig et al., Reference Craig, Goldstein, Scheult, Smith, Gatsonis, Hodges, Kass, McCulloch, Rossi and Singpurwalla1997; Johansen, Reference Johansen2008; Bower et al., Reference Bower, Goldstein and Vernon2010). However, the aim of history matching is to constrain parameter spaces and not necessarily to provide well-understood probabilistic guarantees on those constraints.
In contrast with previous work on constraining climate model parameters, our framework draws on a recent surge of interest in simulator-based inference (Schafer and Stark, Reference Schafer and Stark2009; Cranmer et al., Reference Cranmer, Brehmer and Louppe2020; Dalmasso et al., Reference Dalmasso, Masserano, Zhao, Izbicki and Lee2023) to produce parameter constraints that provide rigorous statistical guarantees of frequentist coverage. Specifically, our work deals with a special case of simulator-based inference where the observations are given by a deterministic simulator and an additive noise model. Patil et al. (Reference Patil, Kuusela and Hobbs2022) and Stanley et al. (Reference Stanley, Patil and Kuusela2022) use a strict bounds method (Stark, Reference Stark1992) to construct efficient confidence sets for the model parameters in closely related inverse problems in remote sensing and high energy physics. Unlike in these works where the forward models of interest are linear and known exactly, the present problem features a forward model (UKESM1) which is nonlinear and estimated using an emulator. We take advantage of the strict bounds method while inverting the emulated forward model numerically and accounting for emulation uncertainty.
Our framework also offers a novel means of accounting for the systematic disagreement, known as the model discrepancy, between a simulator and the physical system which it purports to model. A number of approaches to accounting for model discrepancy in computer model calibration or simulator-based inference have been developed (Kennedy and O’Hagan, Reference Kennedy and O’Hagan2001; Higdon et al., Reference Higdon, Gattiker, Williams and Rightley2008; McNeall et al., Reference McNeall, Williams, Booth, Betts, Challenor, Wiltshire and Sexton2016). We propose a new data-driven procedure for incorporating model discrepancy (and other sources of error that we cannot separately quantify, such as representation errors; Schutgens et al. (Reference Schutgens, Tsyro, Gryspeerdt, Goto, Weigum, Schulz and Stier2017)) into the strict bounds inversion framework. Cloud-based computing resources are leveraged to make each step in the framework computationally scalable.
1.1. Data sources
Aerosol optical depth (AOD) is a measure of the amount of aerosol in the atmosphere. It is measured by MODerate-resolution Imaging Spectroradiometer (MODIS) which is found on board the NASA-launched Terra and Aqua satellites that offer near-global coverage twice daily and provide easily readable open-access data sets. In the flow chart in Figure 1, this data set is the SatelliteTimeseries. For the present application, we focus on MODIS retrievals from the South Atlantic and Central African region over July 1–14, 2017. This domain and period of study are selected for its known biomass burning-related atmospheric aerosol activity.
We compare these data with climate model outputs taken from the UKESM1 model (Sellar et al., Reference Sellar, Jones, Mulcahy, Tang, Yool, Wiltshire, O’Connor, Stringer, Hill, Palmieri, Woodward, de Mora, Kuhlbrodt, Rumbold, Kelley, Ellis, Johnson, Walton, Abraham, Andrews, Andrews, Archibald, Berthou, Burke, Blockley, Carslaw, Dalvi, Edwards, Folberth, Gedney, Griffiths, Harper, Hendry, Hewitt, Johnson, Jones, Jones, Keeble, Liddicoat, Morgenstern, Parker, Predoi, Robertson, Siahaan, Smith, Swaminathan, Woodhouse, Zeng and Zerroukat2019). We use simulations that were nudged to the observed meteorology following the method of Telford et al. (Reference Telford, Braesicke, Morgenstern and Pyle2008), and therefore, the simulated weather conditions will be sufficiently realistic that we can examine the ability of the model to represent the observed aerosols. We use the perturbed parameter ensemble (PPE) of 221 atmosphere-only simulations documented by Regayre et al. (Reference Regayre, Deaconu, Grosvenor, Sexton, Symonds, Langton, Watson-Parris, Mulcahy, Pringle, Richardson, Johnson, Rostron, Gordon, Lister, Stier and Carslaw2023), which have a configuration that closely matches the atmosphere component of the UKESM1 model used in the CMIP6 experiments (Sellar et al., Reference Sellar, Jones, Mulcahy, Tang, Yool, Wiltshire, O’Connor, Stringer, Hill, Palmieri, Woodward, de Mora, Kuhlbrodt, Rumbold, Kelley, Ellis, Johnson, Walton, Abraham, Andrews, Andrews, Archibald, Berthou, Burke, Blockley, Carslaw, Dalvi, Edwards, Folberth, Gedney, Griffiths, Harper, Hendry, Hewitt, Johnson, Jones, Jones, Keeble, Liddicoat, Morgenstern, Parker, Predoi, Robertson, Siahaan, Smith, Swaminathan, Woodhouse, Zeng and Zerroukat2019). For each ensemble member, let $ u $ be a vector of parameter inputs that determine climatic aerosol-related processes of practical importance, and let $ x $ be a vector of control variables that define the specific output of the climate model, denoted $ \eta \left(x,u\right) $ , representing the climate observable $ \zeta (x) $ . In our setting, $ x $ denotes a latitude–longitude-time triple in the climate model output’s spatiotemporal grid, denoted $ {\mathcal{M}}_{\mathrm{sim}} $ and called the SimulatedGrid in Figure 1. The parameters $ u $ are listed in Table 1. The ensemble is a set of simulations, in notation
Note: $ {N}_d $ is the cloud droplet number concentration. SF is short for scale factor.
For reference, this and some later notation used throughout this paper are summarized in Table 2.
1.2. Problem setup
In order to emulate the model output for unobserved parameter values, we assume as in Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020) that at $ x\in {\mathcal{M}}_{\mathrm{sim}} $ , the model output is a realization of a Gaussian process,
For each $ x $ , we train the surrogate model $ {\tilde{\eta}}_x $ as described in Section 2.1.
Let $ z(x) $ denote the AOD observations and $ {u}^{\ast } $ the true parameter value. Assuming that the emulated climate model is unbiased, these observations and parameters are related according to the equations
The different sources of variance in the measurements—namely, the measurement uncertainty ( $ {\varepsilon}_{\mathrm{meas},x} $ ), the emulation uncertainty ( $ {\varepsilon}_{\mathrm{emu},x} $ ), and any other sources ( $ {\varepsilon}_{\mathrm{other},x} $ ) which are not analyzed uniquely, including uncertainty due to model discrepancy and error in representativeness of measurements—are assumed to be mean zero and independent across $ x $ . This is a simplifying assumption in that, in reality, these terms might be correlated across $ x $ . By further assuming that $ {\varepsilon}_{\mathrm{meas},x} $ and $ {\varepsilon}_{\mathrm{other},x} $ are Gaussian, the observations $ z(x) $ are jointly normally distributed across the $ x $ on the spatiotemporal grid $ {\mathcal{M}}_{\mathrm{sat}} $ (SatelliteGrid) on which the MODIS retrievals are gridded. In particular,
where $ {\Sigma}_{\mathrm{meas}} $ and $ {\Sigma}_{\mathrm{emu}} $ are covariance matrices such that entry $ {\Sigma}_{\mathrm{meas},i,j}=\mathrm{Var}\left({\varepsilon}_{\mathrm{meas},{x}^i}\right) $ is the measurement uncertainty at location $ {x}^i $ when $ i=j $ , otherwise zero (i.e., it is a diagonal matrix and the measurement errors are assumed to be uncorrelated between locations); $ {\Sigma}_{\mathrm{emu},i,j}=\mathrm{Var}\left({\varepsilon}_{\mathrm{emu},{x}^i}\left({u}^{\ast}\right)\right) $ is the emulation uncertainty when $ i=j $ , otherwise zero; and $ {\delta}^2=\mathrm{Var}\left({\varepsilon}_{\mathrm{other},\mathrm{x}}\right) $ is a homoscedastic variance term standing in for all other unaccounted-for errors. The matrix $ {I}_{\mid {\mathcal{M}}_{\mathrm{sat}}\mid } $ is an identity matrix with its number of rows equal to the size of the set $ {\mathcal{M}}_{\mathrm{sat}} $ . The disagreement between grids $ {\mathcal{M}}_{\mathrm{sim}} $ and $ {\mathcal{M}}_{\mathrm{sat}} $ is addressed in the following section. The $ {\Sigma}_{\mathrm{emu},i,i} $ are modeled by the surrogate model (see Section 2.1). The $ {\Sigma}_{\mathrm{meas},i,i} $ are the published MODIS uncertainties, which may not account for all possible problems with the retrievals, but these unaccounted-for uncertainties will be captured by $ {\delta}^2 $ , which is estimated from the observations (see Section 2.4).
2. Inference Framework
We used the C3.AI Suite, a cloud computing platform for data analytics workflows deployed to Microsoft Azure infrastructure (C3 Enterprise AI, n.d.). The platform combines databases, open-source packages, and proprietary machine-learning workflows optimized for working with large-scale, data-intensive applications. We built new data structures and methods for processing NETCDF4 files containing high-dimensional time-series datasets. We also developed a scalable inference pipeline for training and predicting through several thousands of Gaussian process models using asynchronous processing such as parallel batch and map-reduce jobs. This pipeline is summarized in Figure 1 and complements other recently published workflows for similar tasks (e.g., Watson-Parris et al., Reference Watson-Parris, Williams, Deaconu and Stier2021).
The grid of satellite measurements is finer than the grid of simulations. The raw MODIS retrievals are pre-processed algorithmically by the MODIS team before they are provided on a $ 1{}^{\circ}\times 1{}^{\circ } $ spatial grid in the Level-3 Global Gridded Atmosphere Product (Hubank et al., Reference Hubank, Platnick, King and Ridgway2020). However, the model outputs are on a $ 1.35{}^{\circ}\times 1.875{}^{\circ } $ grid. To reconcile these differences, we match simulated grid cells to their nearest neighbor on the Level-3 MODIS product grid in space and time. Differences are computed on the resulting $ 1.35{}^{\circ}\times 1.875{}^{\circ } $ resolution MatchedGrid, denoted $ \mathcal{M} $ .
2.1. Emulate: Inside the EmulatorTraining pipe
As indicated in Eq. (1), we assume the climate model is a realization of a Gaussian process where for each $ x\in \mathcal{M} $ the mean and anisotropic exponential covariance functions are
We make this choice for $ {k}_x $ for the reason that a relatively rough process appears reasonable in this problem. We place an anisotropy assumption on the model by fitting a different length scale parameter $ {\mathrm{\ell}}_{i,x} $ for each model parameter $ {u}_i $ . By fitting $ {\mathrm{\ell}}_{i,x} $ separately for each $ x $ , we let the parameters’ effects on the model output vary geographically.
We train the emulating Gaussian processes by estimating their parameters on $ {D}_{\mathrm{train}} $ using maximum likelihood (Rasmussen and Williams, Reference Rasmussen and Williams2006) with the scikit-learn Python library and L-BFGS-B optimization algorithm, and thus we obtain a collection of models $ {\tilde{\eta}}_x $ , $ x\in \mathcal{M} $ . Figure 2 illustrates the quality of these emulators by verifying that the emulated AOD varies with respect to active parameters exactly as the training data set would suggest. In terms of scalability, an ordinary Gaussian process model on the entire ModelTimeseries would compute in $ O\left({\left(|\mathcal{M}|n\right)}^3\right) $ time, $ n $ being the number of members in the PPE, whereas the EmulatorTraining pipe trains in $ O\left(|\mathcal{M}|{n}^3\right) $ time. This routine is parallelized by distributing batches of training jobs to independent worker nodes on the C3.AI Suite cluster, making the physical wait time much shorter.
2.2. Predict: Inside the EmulatorEvaluation pipe
We uniformly sample within the ranges given in Table 1 a collection of 5,000 new parameter vectors $ {u}^k $ (in contrast with 221 training vectors) to obtain a testing sample
This set pairs points in the spatiotemporal-parametric space with the emulated mean and variance of the AOD response, specifying a distribution closely mimicking the model response surface $ \eta \left(x,u\right) $ . We expect that this number of sampling points $ {u}^k $ is sufficient to reliably constrain a small number of parameters. These predictions are performed efficiently by a map-reduce job across the collection of models $ {\tilde{\eta}}_x $ .
The surrogate model appears to perform well at most locations in the spatio-temporal domain. However, gross discrepancies between the observations and the model output arise in a small fraction of the grid points which cannot be accounted for using our model discrepancy term (described later). In particular, when we consider the distance metric
we identify about 2% of the grid points for which $ {J}_{x,\delta}\left({u}^k\right) $ is a visible outlier for all $ {u}^k $ , $ k=1,2,\dots, \mathrm{5,000} $ . The parameter $ \gamma $ here was tuned visually based on QQ plots since the other variance parameter $ \delta $ is yet unestimated. Let $ {\mathcal{M}}^{\ast } $ be the remaining coordinates in the spatiotemporal grid that have not been excluded either as outliers or due to missingness.
2.3. Discrepancy: Inside the DataDrivenModelDiscrep pipe
The quantity $ {\delta}^2 $ as seen in (2) is the variance of the unaccounted-for uncertainty $ {\varepsilon}_{\mathrm{other},x} $ in our model, which we estimate using maximum likelihood. We write down the likelihood for unknowns $ u $ and $ {\delta}^2 $ from our Gaussian assumption,
where
To numerically obtain the maximum likelihood estimate for $ {\delta}^2 $ , we compute the maximizing value $ {\hat{\delta}}_k^2 $ of $ \log L $ for each of the test parameters $ {u}^k $ using the scipy Python library’s implementation of Brent’s algorithm (Press et al., Reference Press, Teukolsky, Vetterling and Flannery1992); then among these maximizing values we select the one which gives the overall maximum likelihood over $ k=1,2,\dots, \mathrm{5,000} $ . The resulting estimate $ {\hat{\delta}}_{\mathrm{MLE}}^2={\hat{\delta}}_{\hat{k}}^2 $ , where $ \hat{k}=\arg \hskip0.1em {\max}_k\log L\left({u}^k,{\hat{\delta}}_k^2;{D}_{\mathrm{train}}\right) $ , is an approximate estimator, where the approximation is due to the search over the parameter space being finite.
This part of the pipeline runs quickly. Evaluating the expression for the likelihood and performing the optimization routine took about 5 min in real time in our case and can be performed in local memory. Our value for the estimator on the remaining data was $ {\hat{\delta}}_{\mathrm{MLE}}^2=0.025 $ , which is of similar magnitude as the average measurement variance of $ {\overline{\sigma}}_{\mathrm{meas}}^2=0.027 $ and emulation variance of $ {\overline{\sigma}}_{\mathrm{emu}}^2\left(\hat{u}\right)=0.038 $ .
2.4. Test: Inside the PlausibilityTest pipe
To obtain a confidence set for the underlying atmospheric parameters, we perform a test for parameter plausibility using MODIS AOD observations on the MatchedGrid. Following the terminology used by Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020), we write down the implausibility metric
For each $ u $ in $ {D}_{\mathrm{test}} $ , we compare our observed implausibility measure against its approximate distribution under the null hypothesis that $ u $ is the correct parameter, $ {H}_0:I(u)\sim \sqrt{\chi^2\left( df=|{\mathcal{M}}^{\ast }|\right)} $ . Here we use the facts that the sum of $ n $ squared independent standard Gaussian random variables is distributed as $ {\chi}^2\left( df=n\right) $ and that $ n $ is assumed to be large enough that we can ignore the variation in the model-discrepancy variance estimate $ {\hat{\delta}}_{\mathrm{MLE}}^2 $ . In other words, testing at the 0.05 significance level, a parameter vector $ u $ will be deemed implausible if $ I(u) $ exceeds the $ 95 $ th percentile of the above null distribution.
A method for obtaining confidence sets inspired by the application of history matching in Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020) can also be derived. We find that inference based on this method is sensitive to the choice of tolerance level that is explicit in their choice of implausibility statistic. For a discussion of how our test differs from this instance of history matching, see the Appendix in the Supplementary material.
2.5. Infer: Inside the FrequentistConfSet pipe
Having obtained a collection of test results for each $ {u}^k $ , $ k=1,2,\dots, \mathrm{5,000} $ , we approximate the Neyman inversion (Dalmasso et al., Reference Dalmasso, Masserano, Zhao, Izbicki and Lee2023) of the test for plausibility described above by retaining all those parameters $ {u}^k $ for which we do not reject the null at 0.05 significance level to obtain an approximate $ 95\% $ confidence set.Footnote 1 This is the region of the 17-dimensional parameter space which exclusively contains non-implausible parameter vectors. A two-dimensional projection of this set is on the right of Figure 3.
3. Results
Using our strict bounds-based test for parameter plausibility, we obtain a simultaneous 95% confidence set on the selected climate model parameters. We find that large values of the sea spray emission flux parameter are on the verge of being constrained with just 2 weeks of AOD data, shown in the upper left panel of Figure 3. However, the two plots on the bottom left show that for no level of either BVOC SOA or the accumulation mode dry deposition rate does our test for plausibility always fail, and so formal constraints on these cannot be obtained. To illustrate this point, suppose that the value of the accumulation dry deposition rate parameter (bottom left of Figure 3) was wrongly set to 1 when the true value is 10. Then there are combinations of the other 16 selected parameters that enable us to fit the model outputs to the observed AOD data within the modeled uncertainties. Hence we cannot rule out value 1 for the accumulation dry deposition rate parameter, and likewise for the other values for each parameter. If evaluated at a lower confidence level, our results seem broadly consistent with Regayre et al. (Reference Regayre, Deaconu, Grosvenor, Sexton, Symonds, Langton, Watson-Parris, Mulcahy, Pringle, Richardson, Johnson, Rostron, Gordon, Lister, Stier and Carslaw2023), who constrain the dry deposition rate toward large values, and Regayre et al. (Reference Regayre, Schmale, Johnson, Tatzelt, Baccarini, Henning, Yoshioka, Stratmann, Gysel-Beer, Grosvenor and Carslaw2020) and Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020), who also constrain the sea spray emission parameter.
We are able to obtain a constraint at 95% confidence level on the combination of the deposition rate of dry aerosols in the accumulation mode and biogenic secondary organic aerosol from volatile organic compounds. See the right panel of Figure 3 for a binned projection of the resulting confidence set onto their span. The resulting constraint can be understood as physically meaning the following: If one hypothesizes that there are a lot of aerosols emitted by vegetation while the deposition rate is low for relatively large particles, then one will overestimate AOD so much that even when controlling for the uncertainty of the MODIS retrieval estimates due to instrumental error, the imperfection of our surrogate model, and any other sources of model discrepancy that we estimate for our climate model, a significant region of the subspace of these parameters can be ruled as implausible at the 95% confidence level.
4. Conclusion
To the best of our knowledge, this is the first use of simulated AOD from a climate model at a time resolution as high as three-hourly to obtain observation-based constraints on input parameters likely to regulate AOD. That a salient and meaningful constraint has been gleaned from just 2 weeks’ worth of data is suggestive of promising uses of high time resolution data in the future. Our framework is well suited for problem settings where a perturbed parameter ensemble is available for one’s climate simulator and where Gaussian process emulation is appropriate. Notably, any unquantified sources of uncertainty in this setting are accounted for by the data-driven model discrepancy built into the presented pipeline, an aspect that differs from other recent scalable frameworks for model calibration, such as ESEm (Watson-Parris et al., Reference Watson-Parris, Williams, Deaconu and Stier2021). Our approach assumes that the model discrepancy can be captured by an additive Gaussian error that is independent across space and time, so our constraints rely on these assumptions being at least approximately satisfied. A fundamental difference between our approach and the previous history matching approach as done by Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020) is that our method provides frequentist confidence sets with well-defined probabilistic guarantees. In addition, as described in the Appendix in the Supplementary material, our method has a potential advantage over Johnson et al. (Reference Johnson, Regayre, Yoshioka, Pringle, Lee, Sexton, Rostron, Booth and Carslaw2020) in that the latter is sensitive to the tuning of its tolerance level. The computational cost of our pipeline is dominated by the Gaussian process computations, so the approach is computationally feasible as long as enough parallel processing resources are available for training and evaluating the pixel-wise emulators.
There are limitations to what can be achieved: with an imperfect and over-parameterized model, constraints can become inconsistent, or different parameter combinations can yield the same simulated ERF (Lee et al., Reference Lee, Reddington and Carslaw2016). However, employing more quantities for which we have observational and simulated data would nonetheless allow us to constrain a larger number of these parameters and get the most stringent constraints on aerosol radiative forcing we can. Other observable atmospheric quantities such as sulfates or organic carbon are sensitive to different sets of atmospheric parameters than those to which aerosol optical depth is sensitive, yielding potential opportunities to further constrain the parameter space using larger, more diverse observational data sets.
Acknowledgements
We are grateful to the two anonymous referees for the insightful comments we received through the peer review process for this manuscript. We also thank the Carnegie Mellon University Statistical Methods for the Physical Sciences (STAMPS) Research Group for helpful discussions and feedback at various stages throughout this work.
Author contribution
Conceptualization: H.G., M.K.; Formal analysis: J.C.; Methodology: J.C., M.K., K.C., L.R., H.G.; Climate simulations: L.R., K.C., L.D., P.S.; Analysis software: B.A., J.C.; Writing–original draft: J.C. Writing–review and editing: J.C., M.K., H.G., L.R., P.S.
Competing interest
The authors declare that no competing interests exist.
Data availability statement
The MODIS AOD measurements supporting the results of this work can be accessed through the MODIS web page: https://modis.gsfc.nasa.gov/data/dataprod/mod04.php (last access: 17 January 2023). Output from the A-CURE PPE is available on the CEDA archive (Regayre et al., Reference Regayre, Deaconu, Grosvenor, Sexton, Symonds, Langton, Watson-Parris, Mulcahy, Pringle, Richardson, Johnson, Rostron, Gordon, Lister, Stier and Carslaw2023). The code to reproduce results from this paper can be found on GitHub: https://github.com/c3aidti/smoke/tree/main/climateInformatics2023 (last version: 31 March 2023).
Funding statement
J.C., M.K., and H.G. acknowledge grant funding from the C3.AI Digital Transformation Institute. M.K. was also supported in part by NSF grants DMS-2053804 and PHY-2020295, and H.G. by NASA Grant No. 80NSSC21K1344. L.R. and K.S. acknowledge funding from NERC under grants AEROS, ACID-PRUF, GASSP, and A-CURE (NE/G006172/1, NE/I020059/1, NE/J024252/1, and NE/P013406/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Provenance statement
This article is part of the Climate Informatics 2023 proceedings and was accepted in Environmental Data Science on the basis of the Climate Informatics peer review process.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/eds.2023.12.